In [2]:
import pandas as pd
import numpy as np
import soundfile as sf

from musicntd.model.current_plot import *
import musicntd.data_manipulation as dm
import musicntd.autosimilarity_segmentation as as_seg
import musicntd.tensor_factory as tf
import musicntd.model.features as features
import musicntd.scripts.hide_code as hide
import nn_fac.ntd as NTD

This notebook aims at studying different representations for music, and at finding the more interesting for our context (description of the signal in NTD).

Here, we will compare:

  • STFT: Short-Time Fourier Transform,
  • CQT: Constant-Q Transform,
  • PCP: Pitch-Class Profiles, or Chromas.

This notebook is organised as follows:

  • Firstly, we will plot these representation on a song (1.wav from RWC Pop),
  • Secondly, we will see NTD results on these previously computed representations, and segment them while presenting two segmentation methods in the same time,
  • Finally, we will decompose and segment the entire RWC Pop dataset with fixed parameters, so as to compare these representations quantitatively.

Visualization on a song

Chosen song: 1.wav from RWC Pop

In [4]:
dataset_path = "C:\\Users\\amarmore\\Desktop\\Audio samples\\RWC Pop\\Entire RWC"
song_number = 1
song_path = dataset_path + "\\{}.wav".format(song_number)

# Choice of the annotation
annotations_type = "MIREX10"
annotations_folder = "C:\\Users\\amarmore\\Desktop\\Audio samples\\RWC Pop\\annotations\\{}\\".format(annotations_type)
annotation_path = annotations_folder + dm.get_annotation_name_from_song(song_number, annotations_type)

Parameters of the representations

In [5]:
hop_length = 512
n_fft = hop_length * 4
the_signal, sampling_rate = sf.read(song_path)
hop_length_seconds = hop_length/sampling_rate

STFT

In [7]:
stft_spec = features.get_spectrogram(the_signal[:,0], sampling_rate, 
                                     feature="stft", n_fft = n_fft, hop_length = hop_length)
plot_me_this_spectrogram(stft_spec, invert_y_axis = False, x_axis = "Time (in number of frames)", y_axis = "Frequency (in indexes of bandwidths)")

In this example, low frequencies seem to dominate the decomposition. This is related to acoustic properties of the human ear, where high frequencies are perceived stronger than low ones at a same sound intensity. In that sense, to be perceived equally, low frequencies are usually more intense than high frequencies.

Constant-Q Transform

In [5]:
cqt_spec = features.get_spectrogram(the_signal[:,0], sampling_rate, 
                                     feature="cqt", n_fft = n_fft, hop_length = hop_length)
plot_me_this_spectrogram(cqt_spec, invert_y_axis = False, x_axis = "Time (in number of frames)", y_axis = "Index of Constant-Q bandwidth")
C:\Users\amarmore\AppData\Local\Continuum\anaconda3\envs\NTD_segmentation\lib\site-packages\scipy\sparse\lil.py:514: FutureWarning: future versions will not create a writeable array from broadcast_array. Set the writable flag explicitly to avoid this warning.
  if not j.flags.writeable or j.dtype not in (np.int32, np.int64):

CQT seems to handle more mid-frequencies information then STFT, which, empirically, seems desirable.

Note that the 0 in the y-axis refer to the first Constant-Q bandwidth and not a "0 note". Bandwidths are aligned with midi notes, and the first bandwidth (0) is the 24-th midi note (C1). Hence, bandwidths should be considered with an offset of 24 to be converted in midi scale.

Pitch-class profiles/Chromas

In [7]:
pcp_spec = features.get_spectrogram(the_signal[:,0], sampling_rate, 
                                     feature="pcp", n_fft = n_fft, hop_length = hop_length)
plot_me_this_spectrogram(pcp_spec, invert_y_axis = False, x_axis = "Time (in number of frames)", y_axis = "Note index (in pitch-class, semi-tone spaced)")

PCP represent the harmonic content in the song, and discard the percussive content.

In this implementation, PCP are computed from the CQT of the signal.

Segmenting these spectrograms

Now, we will compute NTD on these three examples, and compare the resulting segmentations.

Firstly, we will load the annotation.

In [5]:
# Loading and formatting annotations
annotations = dm.get_segmentation_from_txt(annotation_path, annotations_type)
references_segments = np.array(annotations)[:,0:2]

Tensor generation

In order to perform NTD on these examples, we need to transform the chromagrams in "Time-Frequency-Bar tensor", i.e. tensors were time is divided in two scales:

  • one for the inner time of bars,
  • one representing each bar in the song.

(see the article "Uncovering Audio Patterns in Music with Nonnegative Tucker Decomposition For Structural Segmentation", soon published at the time I write these lines, for detailled explanation of this tensor).

In order to construct our tensor, we first need to estimate the downbeats. This is made via the madmom toolbox [4].

In [6]:
# Estimate the downbeats/bar frontiers
bars = dm.get_bars_from_audio(song_path)

# Convert the annotation in the bar scale: this is used for plotting the annotation on our figures.
annotations_frontiers_barwise = dm.frontiers_from_time_to_bar(np.array(annotations)[:,1], bars)

We then cut the chromagram at each downbeat, in order to have a collection of chromagrams, one for each bar.

This collection of chromagrams will then form the tensor. A problem though is that bars can be of different length, resulting in chromagrams of different length. This is a problem, as tensor need slices (i.e. the chromagrams taken individually) of the same size.

In these experimentations, we decided to search for the longest bar, use its length (in number of frames) as the default dimension for all bars, and to zero-pad the ones who were shorter, as in [2].

This method was then changed, see the 3rd notebook for details.

In [9]:
stft_tensor_spectrogram = tf.tensorize_barwise(stft_spec, bars, hop_length_seconds)
cqt_tensor_spectrogram = tf.tensorize_barwise(cqt_spec, bars, hop_length_seconds)
pcp_tensor_spectrogram = tf.tensorize_barwise(pcp_spec, bars, hop_length_seconds)

Showing slices of the tensor

In [10]:
# One particular slice of the tensors, representing a particular bar (48-th bar exactly).
idx = 48
hide.slice_tensor_spectrogram(idx, stft_tensor_spectrogram,cqt_tensor_spectrogram,pcp_tensor_spectrogram)
Mesure 48 de la chanson selon les représentations

NB: note that these bars contain blank frames at the end. This is because, at the time of these experimentation, bars weren't fitted to hold the same number of frames, but rather the size of the largest bar, and smaller bars were padded with zero frames, as in [2].

Decomposing the tensor

In [8]:
# Rank selection
ranks = [32,32,32]

Now, we decompose these tensors by NTD.

Firstly, we will print factor matrices resulting of this decomposition, and secondly, we will print the autosimilarity of the $Q$ (and normalized $Q$) matrices, along with the autosimilarity of the signal, for comparison.

STFT

In [12]:
stft_core, stft_factors = NTD.ntd(stft_tensor_spectrogram, ranks = ranks, init = "tucker", verbose = False, hals = False,
                        sparsity_coefficients = [None, None, None, None], normalize = [True, True, False, True])
In [13]:
hide.nice_plot_factors(stft_factors)
In [14]:
hide.nice_plot_autosimilarities(stft_factors[2], stft_tensor_spectrogram, annotations_frontiers_barwise)

CQT

In [15]:
cqt_core, cqt_factors = NTD.ntd(cqt_tensor_spectrogram, ranks = ranks, init = "tucker", verbose = False, hals = False,
                        sparsity_coefficients = [None, None, None, None], normalize = [True, True, False, True])
In [16]:
hide.nice_plot_factors(cqt_factors)
In [17]:
hide.nice_plot_autosimilarities(cqt_factors[2], cqt_tensor_spectrogram, annotations_frontiers_barwise)

PCP/Chromas

NB: Chroma dimension being 12, rank of $W$ shouldn't exceed 12 as it will probably create redundancy in the factors, and be counter-productive to a salient decomposition. It is then fixed to 12.

In [18]:
pcp_core, pcp_factors = NTD.ntd(pcp_tensor_spectrogram, ranks = [12, 32, 32], init = "tucker", verbose = False, hals = False,
                        sparsity_coefficients = [None, None, None, None], normalize = [True, True, False, True])
In [19]:
hide.nice_plot_factors(pcp_factors)
In [20]:
hide.nice_plot_autosimilarities(pcp_factors[2], pcp_tensor_spectrogram, annotations_frontiers_barwise)

Segmentation techniques

In order to compare these three decomposition, we will study the segmentation computed from their autosimilarities.

We have developed in that context 2 segmentation methods based on the principle of a kernel sliding along the diagonal of the autosimilarity.

Both methods are respectively:

  • novelty computation,
  • convolution computation.

Novelty computation

Firstly, the novelty computation.

This method is the method developed in [1].

The idea is to apply a kernel containing only 1 and -1 values, and disposed like that: $\left[ \begin{matrix} 1 & 1 & -1 & -1\\ 1 & 1 & -1 & -1\\ -1 & -1 & 1 & 1 \\ -1 & -1 & 1 & 1 \end{matrix} \right]$ (of size 4 here, but of size 16 in further tests).

For each bar of the song, we center the kernel on its index on the diagonal of the autosimilarity matrix, crop the autosimilarity to be of the size of the kernel, and sum all values obtained by the term-to-term product between both equal-size matrices.

Positive terms represent the inner similarity of both near past and near future of this bar. A high score indicates that both past and future are similar with themselves.

Negative terms represent the intra similarity between these two zones, i.e. the similarity between the near past and the near future of this bar. If this term is high, it means that both the past and the future are similar, and the resulting novelty score will be low as both positive and negative terms will cancel themselves. Conversely, if this term is low, it means that near past and near future are dissimilar, and so the novelty score will be high. Hence, this bar will be probably a frontier between two really different zones.

Consecutively to this novelty score, computed on every bar of the song, frontiers will be selected as follows:

  • We pick all peaks of this measure (a peak is defined as a local maximum of the measure, i.e. if the novelty cost $n_b$ of bar $b$ is higher than the cost of the previous bar $n_{b-1}$ and than the cost of the following bar $n_{b+1}$).
  • We store the global maximum of novelty on this song, denoted $n^{max}$.
  • We threshold all peaks as a percentage of this maximum $n^{max}$ (for these tests, this percentage is set to 22%, so the threshold will be $0.22 \times n^{max}$).
  • All peaks higher or equal to this thresold will be considered frontiers. The other will be ignored.

Hence, frontiers are the bar indexes of the peaks of novelty measure, thresholded comparetively to the highest value.

Below are shown visualization of this novelty measure, in black, so as frontiers, as bars:

  • In green, estimated frontiers which are correct (True Positives),
  • In orange, estimated frontiers which are incorrect (False Positives),
  • In blue, frontiers from the reference which are not estimated (False Negatives).
In [21]:
hide.nice_plot_novelty_measures(stft_factors[2], cqt_factors[2], pcp_factors[2], annotations_frontiers_barwise)

From these frontiers, we compute some evaluation scores (in French in the legend):

  • True Positive Rate (TP): correctly estimated frontiers,
  • False Positive Rate (FP): estimated frontiers, but which are incorrect,
  • False Negative Rate (FN): frontiers from the annotation which are not found.
  • Precision: $\frac{TP}{TP + FP}$ (proportion of all correctly estimated frontiers among all the estimated ones).
  • Recall: $\frac{TP}{TP + FN}$ (proportion of all correctly estimated frontiers among the reference ones).
  • F measure: $\frac{2 \times Precision \times Recall}{Precision + Recall}$.
In [22]:
hide.print_dataframe_results_novelty(stft_factors[2], cqt_factors[2], pcp_factors[2], bars, references_segments)
Out[22]:
Vrai Positifs Faux Positifs Faux Négatifs Precision Rappel F mesure
STFT 0.5 seconds 6.0 8.0 16.0 0.4286 0.2727 0.3333
3 seconds 6.0 8.0 16.0 0.4286 0.2727 0.3333
CQT 0.5 seconds 10.0 8.0 12.0 0.5556 0.4545 0.5000
3 seconds 10.0 8.0 12.0 0.8333 0.6818 0.7500
PCP 0.5 seconds 13.0 12.0 11.0 0.5217 0.5455 0.5333
3 seconds 13.0 12.0 11.0 0.6957 0.7273 0.7111

Convolution computation

The second segmentation algorithm is the convolution computation.

The main idea is that, the darker is the zone we analyze, the more similar it is, and hence the more probable this zone belongs to a same segment.

For that, we want to evaluate the quantity of dark areas outside from the diagonal. Indeed, the diagonal will always be highly similar (as it represent the self-similarity of this bar), and will dominate the other values of similarity.

The associated convolution cost is obtained with a kernel of this shape: $\left[ \begin{matrix} 0 & 1 & 1 & 1\\ 1 & 0 & 1 & 1\\ 1 & 1 & 0 & 1 \\ 1 & 1 & 1 & 0 \end{matrix} \right]$ (of size 4 here).

Hence, the convolution cost for a segment ($b_1$, $b_2$) is the sum of the term-to-term product of this kernel and the associated autosimilarity of this segment, centered on the diagonal and cropped to the boundaries of the segment (i.e. the autosimilarity restricted to the indexes $(i,j) \in [b_1, b_2]^2$). The kernel and the autosimilarity must be of the same size.

Starting from this cost, we aim at finding the sequence of segments maximizing the local convolution costs.

To this end, we use a dynamic programming algortihm (inspired from [3]) which try to maximimze the global sum of all costs of a sequence of segments:

  • We iterate over all bars of the music, in temporal order, considering them as segment ends. Let's denote $b_e$ one of these bars.
  • We then iterate over all bars which could start a segment ending at $b_e$ (here, all previous bars could start the segment, as long as the segment is shorter than 36 bars). Let $b_s$ be one of these potential starts, and $S_{starts}$ the set of all these bars.
  • We compute the convolution cost associated with this segment [$b_s$, $b_e$], which we denote $c_{s,e}$.
  • Having iterated over all previous bars, we know the optimal segmentation between the start of the music and the bar $b_s$, considering $b_s$ as the end of a segment. We can make this assumption as segmenting the music before $b_s$ is a different problem than segmenting after $b_s$ if we fix $b_s$ as an end of segment. Hence, we have the cost $c_{0,s}$, cost of the optimal segmentation ending at $b_s$, and so we define $c_{0,s,e} = c_{0,s} + c_{s,e}$.
  • Hence, iterating over the $b_s \in S_{starts}$, we compute a set of costs $\{c_{0,s,e}\}$. As a numerical and finite set, it admits a maximum for a bar $b_s^{opt} \in S_{starts}$, and we define $c_{0,e} = c_{0,s^{opt},e}$, and stores $b_s^{opt}$ as the best start for a segment ending at $b_e$. Hence, the optimal segmentation for ending a segment at bar $b_e$ is the segmentation happening before $b_s^{opt}$ and the segment [$b_s^{opt}$, $b_e$].
  • At the last iteration over all bars of the music, we compute the optimal segmentation for ending at the last bar of the music, and find the optimal segment to end at the end of the song. Considering its start as the end of the previous segment, we obtain the previous optimal segment, and etc. We then obtain the optimal segmentation by rolling back all optimal starts that we have computed.

This algorithm then return the sequence of segments which maximizes the sum of all segment cost, among all possible segments in this song.

NB: more details about this algorithm can be found in the Notebook "Appendix - Focus on the segmentation algorithm".

Below are shown the autosimilarities of the given song, along with frontiers estimated by the convolution algorithm:

  • In green, estimated frontiers which are correct (True Positives),
  • In orange, estimated frontiers which are incorrect (False Positives),
  • In blue, frontiers from the reference which are not estimated (False Negatives).
In [23]:
hide.nice_plot_convolution_measures(stft_factors[2], cqt_factors[2], pcp_factors[2], annotations_frontiers_barwise)
In [ ]:
plot_spec_with_annotations_and_prediction(factors[2], annotations_frontiers_barwise, segments)

From these frontiers, we compute some evaluation scores (in French in the legend):

  • True Positive Rate (TP): correctly estimated frontiers,
  • False Positive Rate (FP): estimated frontiers, but which are incrrect,
  • False Negative Rate (FN): frontiers from the annotation which are not found.
  • Precision: $\frac{TP}{TP + FP}$ (proportion of all correctly estimated frontiers among all the estimated ones).
  • Recall: $\frac{TP}{TP + FN}$ (proportion of all correctly estimated frontiers among the reference ones).
  • F measure: $\frac{2 \times Precision \times Recall}{Precision + Recall}$.
In [24]:
hide.print_dataframe_results_convolution_features(stft_factors[2], cqt_factors[2], pcp_factors[2], bars, references_segments)
Out[24]:
Vrai Positifs Faux Positifs Faux Négatifs Precision Rappel F mesure
STFT 0.5 seconds 9.0 4.0 13.0 0.6923 0.4091 0.5143
3 seconds 9.0 4.0 13.0 0.6923 0.4091 0.5143
CQT 0.5 seconds 8.0 7.0 14.0 0.5333 0.3636 0.4324
3 seconds 8.0 7.0 14.0 0.8000 0.5455 0.6486
PCP 0.5 seconds 18.0 8.0 5.0 0.6800 0.7727 0.7234
3 seconds 18.0 8.0 5.0 0.7600 0.8636 0.8085

Results on the entire RWC dataset

Finally, we computed these scores on all songs (100) of the RWC Popular dataset, with the convolution computation algorithm.

Ranks were all set to 32 (except for $W$ in PCP, where its uncompressed dimension is 12, so the rank were set to 12, as above).

From these frontiers, we compute the same evaluation scores as before.

In [25]:
hide.run_and_show_results_on_rwc()
Out[25]:
Vrai Positifs Faux Positifs Faux Négatifs Precision Rappel F mesure
STFT 0.5 seconds 7.70 10.45 10.65 0.4599 0.4260 0.4204
3 seconds 7.70 10.45 10.65 0.5973 0.5556 0.5475
CQT 0.5 seconds 8.40 14.42 9.96 0.4132 0.4618 0.4141
3 seconds 8.40 14.42 9.96 0.5379 0.6045 0.5410
PCP 0.5 seconds 11.12 14.13 7.77 0.4564 0.5922 0.5050
3 seconds 11.12 14.13 7.77 0.5446 0.7050 0.6019

Conclusion

In conclusion, with these results, we decided:

  • to consider only the PCP for describing music in the future experiments,
  • to focus our efforts on the convolution computation algorithm only, as it does not contain hyperparameters (conversely to the threshold in the novelty measure) and because it seems more adapted to our context (sparse autosimilarities, where the novelty seems unstable due to large white zones, conversely to the convolution which focuses on dark blocks).

References

[1] Foote, J. (2000, July). Automatic audio segmentation using a measure of audio novelty. In 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No. 00TH8532) (Vol. 1, pp. 452-455). IEEE.

[2] Smith, J. B., & Goto, M. (2018, April). Nonnegative tensor factorization for source separation of loops in audio. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 171-175). IEEE.

[3] Gabriel Sargent, Frédéric Bimbot, and Emmanuel Vincent. Estimating the structural segmentation of popular music pieces under regularity constraints. IEEE/ACMTransactions on Audio, Speech, and Language Processing, 25(2):344–358, 2016.

[4] Böck, S., Korzeniowski, F., Schlüter, J., Krebs, F., & Widmer, G. (2016, October). Madmom: A new python audio and music signal processing library. In Proceedings of the 24th ACM international conference on Multimedia (pp. 1174-1178).